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ABSTRACT 

We study the problem of local alignment, which is finding pairs of 
similar subsequences with gaps. The problem exists in biosequence 
databases. BLAST is a typical software for finding local align- 
ment based on heuristic, but could miss results. Using the Smith- 
Waterman algorithm, we can find all local alignments in 0{mn) 
time, where m and n are lengths of a query and a text, respec- 
tively. A recent exact approach BWT-SW improves the complexity 
of the Smith- Waterman algorithm under constraints, but still much 
slower than BLAST. This paper takes on the challenge of designing 
an accurate and efficient algorithm for evaluating local-alignment 
searches, especially for long queries. In this paper, we propose an 
efficient software called ALAE to speed up BWT-SW using a com- 
pressed suffix array. ALAE utilizes a family of filtering techniques 
to prune meaningless calculations and an algorithm for reusing 
score calculations. We also give a mathematical analysis and show 
that the upper bound of the total number of calculated entries using 
ALAE could vary from 4.50mn M0 to 9.05mn ' 896 for random 
DNA sequences and vary from 8.28mn ' 364 to 7.49mn ' 723 for 
random protein sequences. We demonstrate the significant perfor- 
mance improvement of ALAE on BWT-SW using a thorough ex- 
perimental study on real biosequences. ALAE guarantees correct- 
ness and accelerates BLAST for most of parameters. 

1. INTRODUCTION 

Similar to web applications, another area that has recently wit- 
nessed a rapid surge in the amount of data being produced is the 
biosequence search. In this area, scientists often want to compare 
a biosequence against a collection of known sequences. General- 
ly, two biologically related sequences appearing dissimilar in their 
entirety may contain subsequences that are highly similar. 

Local alignment is a common technique for finding a pair of 
highly similar substrings from two given sequences, respectively. 
In querying biological sequences, search tools often distinguish 
between short queries and long queries [9]. For example, short 
queries (read) are used to find the same structural or functional 
subunits (motifs) from very different protein families or genomes; 
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large genomes or chromosomes, however, need to be compared in 
comparative genomics, such as aligning mouse genomes against 
human genomes [7, 12]. Generally, each biosequence can have the 
scale ranging from a few hundred million characters to a few billion 
characters and the length of a long query could be from a few thou- 
sand characters to ten million characters [7, 8]. Efficiently aligning 
long queries against biosequences poses a competitive challenge to 
the development of alignment tools. 

A variety of computational algorithms have been developed for 
finding local alignments, among which BLAST (Basic Local Align- 
ment Search Tool) [1, 2] and the Smith-Waterman algorithm [13] 
are typical ones. 

BLAST [1] is a popular tool for identifying the local alignments 
between sequences. It decomposes an input query into a set of 
grams and identifies matches against the database using grams of 
the query. A local alignment is created by examining the left and 
right subsequences from these matches. Although this heuristic 
approach suggests a time-optimized model, it does not guarantee 
to find all alignment results that meet the specified score criterion. 

The Smith- Waterman algorithm [13] is a well-known dynam- 
ic programming algorithm that could accurately identify the best 
local alignments between a query sequence and sequences in the 
database. It compares fragments of arbitrary lengths between two 
sequences and supports a flexible scoring scheme by allowing dif- 
ferent scores for different types of operations, including substitu- 
tion, insertion, and deletion. The score of an alignment is a sum- 
mation of the score of each operation involved in the alignment, 
which makes the algorithm sensitive and ensures an optimal align- 
ment of the sequences. However, this also has the effect that the 
method is very slow and CPU intensive. A recent approach called 
BWT-SW [8] is an exact method that improves the complexity of 
the Smith-Waterman algorithm under limited scoring scheme con- 
straints, but still much slower than the very efficient approximate 
method BLAST. 

This paper takes on the challenge of designing an efficient algo- 
rithm for evaluating local-alignment searches exactly. We improve 
the general dynamic programming algorithm by exploiting a fami- 
ly of filtering techniques and reusable calculations. The challenges 
and our contributions are as follows. 

(1) How to avoid calculating most of entries in dynamic pro- 
gramming matrixes without impairing the accuracy of the align- 
ment results? Calculating entries in matrixes is time consuming, 
especially when the text and query are long. We analyze the prop- 
erty of entries in the matrixes and propose a family of filtering tech- 
niques to avoid meaningless calculations in Section 3. Finding that 
there are many duplicate calculations in each matrix, we propose 
an algorithm for reusing those duplicates in Section 4. 
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(2) How to satisfy the space requirement of large biosequences 
for both the text and the query? We consider an in-memory algo- 
rithm and use the recent results on compressed suffix array to make 
our approach (called ALAE) doable in memory. The idea is similar 
to BWT-SW, but we adapt it to fit for our filtering techniques and 
reusing approaches (see Section 5). 

(3) What is the upper bound of the number of calculated en- 
tries? We give a mathematical analysis and prove that ALAE could 
provide a time efficiency guarantee across representative ranges 
of user specified schemes in Section 6. The upper bound of the 
total number of calculated entries using ALAE could vary from 
4.50mn ' 520 to 9.05mn 896 for random DNA sequences and vary 
from 8.28mn 0,364 to 7.49mn 0,723 for random proteins sequences. 

In addition, in Section 7 we show experimental results on real 
biosequence databases including DNAs and proteins to demonstrate 
the space and time efficiency of our ALAE approach. We show that 
ALAE makes a significant improvement of performance on BWT- 
SW for all scoring schemes and thresholds. ALAE also accelerates 
BLAST for most of scoring schemes and guarantees correctness. 

2. PRELIMINARY 

Let E be the alphabet of characters in biosequences. For a se- 
quence S of the characters in E, we use |S| to denote its length, 
S[i] to denote its i-th character (starting from 1), and S[i, j] to de- 
note the subsequence from its i-th character to its j-th character. 
The typical length of a genomic sequence is from millions to a few 
billion. In this section, we give the definition of local alignment. 

2.1 Local Alignment with Affine Gap Penalty 

Before formally defining local alignment, we present the wide- 
ly used scoring scheme for biosequences. In this scoring scheme, 
each identical mapping has a positive score s a , whereas a substitu- 
tion of one character has a negative score Sb, and a gap (insertion 
of r characters or deletion of r characters) has an affine gap penal- 
ty represented as a negative score s g + r x s s , where s g is a gap 
opening penalty represented as a negative score and s 3 is a gap 
extension penalty represented by another negative score for each 
insertion or deletion. We use {s a , Sb,s g ,s s ) to represent a scor- 
ing scheme and use the default scoring scheme (1,-3,-5,-2) in 
both BLAST and BWT-SW to show examples in this paper, which 
means s a = 1, Sb = —3, s g — —5, and s s — —2. 

The similarity between two sequences Si and S2 is defined as 
the value of the alignment of Si and S2 that maximizes total align- 
ment score, denoted sim(Si, S2). For example, let Si = AAACG 
and S2 = AACCG, then the optimal alignment of Si and 52 is to 
replace the third character A of Si with the third character C of S2, 
i.e. sim(Si , S 2 ) = 1 x 4 + (-3) = 1. 

Local alignment problem. Let T be a text sequence of n char- 
acters and P be a query sequence of m characters. For any 1 < 
7Ti < n and 1 < n p < m, compute the largest similarity between 
T[x,n t ] and P[y,n p ] (1 < x < ivt, 1 < y < 7T P ), i.e. the maxi- 
mum alignment score of any substring of T ending at position nt 
and any substring of P ending at position ir p . For biological ap- 
plications, we are only interested in those substring pairs if their 
alignment scores attain a threshold H 1 . 

One naive approach to find all of their local alignments is to ex- 
amine all substrings of T and align them one by one with P. Obvi- 
ously, we want to avoid aligning P with the same substring at dif- 
ferent positions of the text T. A natural solution is to build a suffix 
trie T of the text T as reported in [8]. Then, distinct substrings of T 

1 H could be determined indirectly using user specified expectation value 
i?-value. We discuss it in Section 7. 



are represented by different paths from the root to different nodes 
in the suffix trie. Let p u be a path from the root to a node u in the 
suffix trie T. We align each substring represented by p u against the 
query pattern P. 

Given a data sequence T, a query pattern P, and a threshold 
H, Algorithm 1 shows the BASIC algorithm for answering local 
alignment using the suffix trie T of T. According to the problem 
definition, we use A(i,j) to represent the alignment of T[x, i] and 
P[y,j] with largest alignment score (1 < x < i, 1 < y < j). The 
algorithm first initializes the largest score A(i, j). score — for 
each alignment A(i,j), and the starting position A(i,j).pos = 
(line 1). Let p be a suffix path from the root to a leaf node (line 
2). For each substring X represented by p, the BASIC algorithm 
searches prefix X[l, i] and finds all alignment pairs (X[l, i],P[y, j] 
whose similarities are greater than H (lines 3 - 5). Each prefix 
X [1 , i] corresponds with substrings at different positions 1 1 , . . . , tk , 
which means that the alignment scores of A(ti + i — 1, j), . . ., 
A(tk + i — 1, j) have the same score as sim(X[l, i], P[y, j]). For 
all alignments of X[l, i] and P[y,j] with the same end position 
t +i in T and j in P (ti < t < tk), we choose the largest alignment 
score among them. Let t be the starting position of the alignment 
with largest score. The algorithm sets A(t + i,j).pos = t (lines 6 
- 10). It finally returns all alignments with positive scores that are 
greater than or equal to the threshold H (line 11). 



Algorithm 1: BASIC - Calculating local alignments. 

Input: A suffix trie T of text T with n characters, a query pattern P 

with m characters, and a score threshold H; 
Output: End position pairs of local alignments; 

1 Initialize each alignment A(i, j). score = and A{i, j).pos = 
(l<i<n, l<j<m); 

2 foreach suffix path pfrom the wot to a leaf node in T do 

3 let X be the same substring representing by p; 

4 foreach prefix of X with i characters starting at positions 
ti, . . . ,tk do 

align X[l, i] against P and find all alignment pairs such that 
sim(X[l,i],P[y,j]) > H (l<y<j<m); 

6 foreach above alignment pair (X[l, i], P[y, j]) do 

7 foreach starting position of X[l, i] (ti<£h<tfc) do 
if sim(X[l, i], P[y, j]) > A(tb+i, j)- score then 

9 A(t h +i, j).score = sim(X[l,i],P[y,j]); 

10 A(t h +i, j).pos = t h ; 



11 return alignments A(i, j) if A(i,j). score > H (l<i<n, l<j<m); 



2.2 Dynamic Programming for Answering Lo- 
cal Alignment Exactly 

Given a query pattern P, for each substring X represented by a 
suffix path p from the root to a leaf node, we need to align each pre- 
fix X[l,i] (1<«<|A|) against P. Let M x (i, j) be the best align- 
ment score of X[l, i] and any substring of P ending at position j. 
We allow that any substring P[y, j] (l<y<j) is a potential match. 
We use an auxiliary score G a to store the best alignment score 
under the restriction that X [i] is aligned with a gap, and use anoth- 
er auxiliary score Gb(i,j) to store the best alignment score under 
the restriction that P[j] is aligned with a gap. 

Initial condition: 

M x (0,j) =0 for < j < m. 

M x (i, 0) = s g + i x s s for 1 < i < d. 

G a (0,j) = -oo for0<j<m. 

G b (i,0) = -00 for 1 < i < d. 
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Recurrences (for i > l,j > 1) 

r M X (i-i,j-i) + s(x[i\,p[j]), 

M x (i,j) = max < G a (i,j), \ , where 

{ G b (i,j) 

G„(i, j) = max{G (i-l, j) + s s , M x (i-l,j) + (s g + s s )}. 
G b (i,j) = max{G;,(i, + s s , M x (i, + (s B + s s)}- 
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Figure 1: An example of calculating local alignment score (bold 
values represent Mx(i, j)). 

Fig. 1 shows an example of the dynamic programming of align- 
ing a substring X=GCTA against a query P=GCTAG. For example, 
M x (4, 3) =max{M x (3, 2) + <5(X[4], P[3]), G„(4, 3), G„(4, 3)}. 
Since X[4] does not equal to P[3], 6(X[4],P[3]) = s b = -3. 
G a (4, 3) =max{G a (3, 3)+s 3 , M x (3, 3)+(s s +s s )} =max{-ll 
+ (-2), 3 + (-5 - 2)} = -4 and G 6 (4, 3) = max{G i) (4, 2) + s s , 



M x (4, 2) + (s 9 + s s )} = max{-17 + (-2), -7 + (-5 - 
-14. Therefore, M x (3,4) = max{-5 + (-3), -4, -14} 
The variables used in this paper are shown in Table 1 . 



•2)} = 
-4. 



Table 1: List of variables and their notations. 



Variables 


Notations 


T 


A sequence (a.k.a. a text). 


P 


A query pattern. 


n 


Length of T. 


m 


Length of P. 


X 


A substring represented by a suffix path from the root to a 
leaf node in the suffix trie of T. 


M x 


A matrix for a substring X of T and a query P. 


M x (i,j) 


Best alignment score of X[l, i] and P[y,j], where l<y<j. 


A(i,j) 


The alignment with the largest alignment score of T[x, i] 


and P[y, j], where l<x<i, 1<J/<J. We use A(i, j). score 
to express the largest alignment score, and A(i, j) .pos to 
express the starting position x in T. 



In the remaining of this paper, we will focus on local alignment 
on whole texts only. Notice that, the techniques can be immedi- 
ately applied to collections of sequences: given all the sequences 
Ti , . . . , T„ in the database, we concatenate them into a single se- 
quence T. A local alignment query is then performed directly on 
the sequence T. 

2.3 Suffix Trie and Compressed Suffix Array 

Suffix Trie. Let T be a text with n characters. The suffix trie of the 
text T is a trie whose edges are labeled with strings, such that each 
path from the root of the trie to a leaf represents exactly one suffix 
of T. Each leaf node stores the starting location of the correspond- 
ing suffix of T. 

Compressed Suffix Array. Compressed suffix array [5] is a com- 
bination of the Burrows-Wheeler compression algorithm [3] and 
the suffix array [10]. In [3], Burrows and Wheeler propose a new 
compression algorithm based on a reversible transformation, called 



BWT, which transforms a text T into a new string that is "easy to 
compress." BWT appends a special symbol $ smaller than any oth- 
er symbol of E at the end of T. Let the position of $ be n + 1. 
For example, given a text T = GCTAGC, we append $ at the end of 
T and get T" = GCTAGC $. Then the BWT transformation of T" is 
CTGGA$C. 

The suffix array SA[0, n] of T' is an array of indexes such that 
SA[i] stores the starting position of the i-th lexicographically s- 
mallest suffix. For example, SA of GCTAGC $ is {7, 4, 6, 2, 5, 1,3}. 
The space occupancy of the compressed suffix array is optimal in 
an information-content sense. 

The compressed suffix array can support effective searches for 
arbitrary patterns [5]. Given a substring X, we can use the back- 
ward search algorithm [6] to identify the SA range of X in C(|X|) 
steps. In particular, it processes the last character c of X in the first 
step. It looks at c as a string S. Let [i,j] be the SA range of S. 
Then it processes the string xS by iteratively inserting one charac- 
ter x before S in X. The backward search algorithm shows that 
each step could be done in constant time. For any string X, if there 
exists an SA range of X, say [i,j] in SA, the starting positions 
of X in T can be found in SA[i], SA[i + 1], . . ., and SA[j}. For 
instance, the SA range of a substring GC is [4, 5], then the starting 
positions of GC in T are 5 and 1. 

In Section 5 we show how to simulate traversals of a suffix trie 
T using the compressed suffix array. 

2.4 Related Work 

There are a large amount of techniques on supporting local align- 
ments, such as [4, 11, 13, 14]. 

The Smith- Waterman algorithm supports slow but formally cor- 
rect local-alignment searches and guarantees users the optimal lo- 
cal alignments between query and database sequences. It requires 
0(nm) time complexity, which is a considerable disadvantage. 

OASIS [11] employs a dynamic programming A*-search which 
is driven by traversing a suffix tree index constructed on the database 
sequences. It can accurately find local alignments and outperform- 
s both BLAST and the Smith- Waterman algorithm only when the 
query sequences are very short (less than 60 characters). 

BWT-SW is a recently proposed exact method for finding all lo- 
cal alignments. It uses a BWT index to emulate the suffix trie of T 
and modifies the dynamic programming (i.e. the BASIC algorithm) 
to allow pruning but without missing any results. BWT-SW travers- 
es the suffix trie in preorder and provides an early-termination tech- 
nique by ignoring all negative alignment scores. Each path from the 
root to an intermediate node u represents multiple substrings of T. 
It shows that for any a path from the root to an intermediate node 
u, if the matrix indicates that there is not any substring of the query 
pattern having a positive score when aligned with the path, then 
BWT-SW can safely prune the subtree rooted at u away. Given the 
fixed scoring scheme (1, —3, —5, —2), the expected running time 
is O(mn ' 62S ) for random strings and the total number of calculat- 
ed entries is upper bounded by 69mn 0,628 . In addition, BWT-SW 
requires that \ sb\ > 3|s a j, which highly limits its usability. 

Although BWT-SW improves the time complexity of the Smith- 
Waterman algorithm under the constraint \sb\ > 3\s a \, it is still 
much slower than the approximate method BLAST. By compar- 
ing the Smith- Waterman and BWT-SW algorithms with BLAST 
we find that BLAST gets notable differences in accuracy and speed 
with the former two algorithms. However, BLAST is an approxi- 
mate approach that could not guarantee to find all local alignments 
even though BLAST is indeed accurate enough in most cases [8]. 
In this paper, we propose a new approach ALAE to find all align- 
ments with a comparative speed. 
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3. AVOIDING MEANINGLESS CALCULA- 
TIONS 

Ideally, we hope to only calculate the alignment scores that can 
generate the optimal alignments. For an entry whose alignment 
score is impossible to be an optimal alignment score, we call it 
meaningless, otherwise, we call it meaningful. 

In this section, we propose a family of filtering techniques to 
prune meaningless entries. In Section 3.1, we show that some en- 
tries in a single matrix are meaningless and we propose local filter- 
ing techniques to prune those meaningless ones. In Section 3.2, we 
show that an entry in a matrix is meaningless if its alignment score 
has been calculated in some other matrixes, and we propose global 
filtering techniques across matrixes. 

3.1 Local Filtering 

We propose three local filtering techniques to prune meaningless 
entries in a single matrix for a substring X and a query P. 

3.1.1 Length Filtering 

Given a query P and a score threshold H, the BASIC algorithm 
aligns each substring represented by a suffix path against P. In this 
section, we show that we only need to align substrings of T with 
certain lengths against P. 

THEOREM 1. Length filtering. Given a query P with m char- 
acters, a substring X ofT, and a score threshold H. The 
entry of M x is meaningless ifi does not satisfy the following con- 
dition: 

\ — \ < i < max{m, m + [ — J}. (1) 

We use L m ax to express the length upper bound max{m, m + 

^ H — (s a xm+s g ) j i 

PROOF. Remind that Mx(i,j) is the score of aligning X[l, i] 
against a substring P' of P ending at position j in P. Let the length 
of the substring P' be h. 

When i<h, there are at most i matches and the maximum pos- 
sible score of Mx(i,j) is s a xi. Since we are only interested in 
M x (i,j)>H, we get s a xi > M x (i,j) > H, i.e., < i < h. 

When i>h, there are at most h matches and at least i — h gaps. 
Therefore, the maximum possible score of Mx(i,j) is s a xh + 
s 9 + s s x (i — h). As we have mentioned above, we are only in- 
terested in Mx{i,j) > H, we get s a xh + s g + s s x (i — h) > 
Mx(i,j) > H. Since s 3 < 0, s a — s s > 0, and h < m, we get 
i< s * +(s *T s ° )Xm ~ g ,i.e.,fr<i<m + [ 'H^ttl l, 

Accordingly, i should be either in the interval [\^-~\ , h] or the 

interval (h, m + [ " ff ~^" s xm+3g) j]. Thus, Equation 1 holds. □ 

For example, given a text T=CTAGCTAG, a query P=GCTAC, 
and let the threshold H = 3. We only need to consider substrings 
of T with length in between 3 and 4. 

3.1.2 Score Filtering 

We could early terminate the calculation of a score M x (i,j) 
if we know for any score Mx(i" based on Mx(i,j) (i" > 
> j)< Mx(i",j") is impossible to attain the threshold H. 

THEOREM 2. Score filtering. For any substring X[l,i] start- 
ing at position ir t in T (1 < ir t < n), the (i,j)-entry of Mx is 
meaningless if: 

f o, 

Mx(i,j) < max < H - (m - j) x s a - 1, 

I H — (mm{L max , n — n t } — i) x s 



PROOF. We are only interested in those alignments scores >H. 
Let M x (i,j) be the score of X[l, i] and P[y,j] (1 < y < j). 

(i) BWT-SW shows that M x (i,j) must be greater than 0. Let 
M x {i',j') be the score of X[l,i'] and P[y,f] (i' < i, j' < 
j). Assume M x (i',j')<Q, then the score of the alignment be- 
tween and P[j'+l,j] must be greater than or equal to 
Mx According to the definition of local alignment problem 
in Section 2, the alignment between X[l, i] and P[y,j] could not 
be the best alignment, therefore, the -entry in this matrix is 
meaningless. 

(ii) Now we consider an alignment score Mx in the ma- 
trix Mx. Let C be the alignment score of X[i + 1, i"} and P[j + 

then M x (i",j") = M x {i,j) + C > H. If M x (i, j) < 
H - C - 1, then the alignment of X[l, i"] and P[y, j"} could not 
be the answer, thus it is meaningless to calculate the (i, J)-entry. 

As we know, the only way to increase an alignment score is by 
a match. The largest possible value of C is (j" — j) x s a or 
(mm{L m ax, i"} ~ i) x s a since there are at most (j" — j) or 
(min{L max , i"} — i) matches between X[i + and P[j + 

1, j"]. As we know, the maximal j" is m and the maximal i" is n — 
TT t , then the upper bound of C is (m — j) x s a or (min{L maa; , n — 
irt} - i) x s a . Therefore, when M x (i,j) < H - C - 1 < 
max{H—(m—j)xs a — i, H — (mm{L m a. x ,n—irt}—i)xs a — l}, 
the (i, j)-entry of M x is meaningless. □ 

For example, given a text T=CTAGCTAG. Let X=GCTA be a 
substring of T, and let the threshold H = 3. Consider the matrix 
Mx in Fig. 1. All the entries with negative scores are meaningless. 
The (1, 5)-entry is meaningless, since the lower bound of the score 
for the 5-th column must be 3, but the calculated Mx(l, 5) = 1. 
The lower bound of the scores for the 4-th row is 3. Therefore, 
among the 30 entries in Fig. 1, only four entries (1, 1), (2,2), 
(3, 3), and (4, 4) are meaningful according to score filtering. 

3.1.3 Prefix Filtering 

Consider a suffix path p in the suffix trie T of the text T. Let X 
be the substring represented by the path p. We start from the first 
character of X and align each prefix X[l, i] (1 < i < Lmax) in 
the query P. According to score filtering, we are only interested in 
positive alignment scores. According to the scoring scheme, only 
an identical mapping has a positive score s a . Therefore, for an 
alignment of X[l, i] and P[y, j], there must exist an integer q such 
that X[l, q] exactly matches P[y, y + q— l](l<y<m — q + 
1) to make their alignment score large enough to counteract the 
effect of a mismatch or a gap. Equation 2 defines the length value 
q according to the scoring scheme. 



q = ^ min{M,|s g + s»|} j + 1 



(2) 



,) 



We call the substring X[l, q] q-prefix for the suffix path p. Based 
on the observation above, we present prefix filtering in Theorem 3. 

THEOREM 3. q-Prefix filtering. Let Mx(i,j) be the alignment 
score ofX[l,i] and P[y,j] (l<y<j). The (i,j)-entry of Mx is 
meaningless if X[l, q] does not match P[y, y+q— 1] exactly. 

Furthermore, for a substring X of T, if we could not find an 
exact match between X[l,q] and a substring of P, entries of the 
whole matrix for X and P are meaningless. For instance, consid- 
er a substring A=ACACAT and a query P=GCGTGTGA under the 
scoring scheme (1,-3,— 5,— 2). All entries in the matrix for X 
and P are meaningless since we could not find an exact match of 
X[l, q] in P, where q = 4. 
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In order to find the exact match of X[l, q] in P efficiently, we 
build inverted lists of g-grams of P on the fly. We decompose P 
into a set of g-grams by sliding a window of length q over the char- 
acters of P. For each g-gram in P, we generate an inverted list of 
its start positions in P. The time complexity of building inverted 
lists is 0(m). 

According to Theorem 3, for each starting position tt p of X[l, q] 
in P, there must exist a. fork area that entries outside of this fork 
are meaningless. Fig. 2 shows the sketch of a fork. The rectangle 
in the figure represents a matrix Mx for X and P. Each fork in the 
matrix consists of three regions: an exact match region (denoted 
EMR), a no gap region (denoted NGR), and a gap region. We call 
an entry (I, I + ir p — 1) a first gap open entry (FGOE for short) if 
the entry satisfies the following two conditions: 

(i) M x {l,it p + l-l) > |s s + s s |, and 

(ii) for each i < I and j = tt p + i — 1, Mx(i,j) < \s g + s a \. 
An FGOE belongs to an NGR and it is a point switch from the 

NGR to a gap region. From the FGOE (I, tt p + I — 1), we need to 
calculate another two extension entries (I, tt p + I) and (I + 1, -k p + 
I — 1). The shape of a gap region can be determined by a set of 
extension entries. Each extension entry is represented by a concave 
corner point such that its score is greater than \ s s \. 



case where both X' and X only appear once in T (see Fig. 3). Ac- 
cording to the BASIC algorithm, we need to construct two matrixes 
M x > and Mx- The following two cases show that calculating a 
fork area starting from (1, j) -entry in Mx is meaningless. 
Case (1) : X[l, q] does not match P[j,j + q — 1]. According to 
our analysis in Section 3 . 1 . 3 , the ( 1 , j ) -entry of Mx is mean- 
ingless; or 

Case (2) : X[l, q] matches P[j, j+q—1], and the alignment score 
A (t+i, j). score > s a . Since there is an identity mapping be- 
tween X[l] and P[j], Mx(l,j) must be equal to s a . When 
the matrix M x > makes A(t+i,j). score > s a , we could ig- 
nore calculating Mx(l, j). 



y j j' 



t 




t+i \t+i' 








X- 

— X — 





Figure 3: For the same alignment A(t + using M x > can 

produce higher score than using Mx if M x > > s a , where 
M X ' is associated with the alignment A(t + 




Figure 2: Entries outside of fork areas are meaningless. 

Given a fork starting from the entry (1, tt p ), an entry (i,j) be- 
longs to its EMR if 1 < i < q and j = it p + i — 1. An entry 
belongs to its NGR if q < i < I and j = tt p + i — 1. 

According to Theorem 3, for each entry in an EMR, its 

score Mx(i,j) = i; for each entry in an NGR, it is no need 
to consider the auxiliary scores in G a or Gb, so we can simplify the 
recurrence function of Mx in Section 2.2 as follows: 

M x (i,j) = M x (i - 1, j - 1) + S(X[i],P[j]). (3) 

3.2 Global Filtering 

Basically, given a suffix trie T of text T with k paths, we need 
to calculate k matrixes to align a substring represented by each suf- 
fix path against the query pattern P. For each matrix, we need to 
calculate entries inside its forks. A natural question is whether we 
can safely avoid calculating a certain fork in a matrix based on the 
results of calculated matrixes. The answer to this question is yes if 
we could find a "good" order of calculations for suffix paths in T. 

In this section, we first analyze the effect of a calculated matrix 
and use bitwise operations to dynamically update and check mean- 
ingless calculations. We then show there exists dominate relation- 
ships between g-prefixes in T and observe that a fork area could be 
safely pruned using g-prefix domination. We show how to find the 
good order of calculations based on g-prefix domination and give a 
space-efficient approach to do global filtering for a large text. 

3. 2. 1 Meaningless Fork Areas 

Let X' be a substring starting at position t in T and X be the suf- 
fix of X' starting at position t + i in T. We first consider the simple 



We extend the above analysis to the general scenario that X 
might have multiple occurrences in T. 

THEOREM 4. Let X be represented by the suffix path p u . Let 
ti, . . . ,tk be the starting positions of X. The (1, j)-entry of Mx 
is meaningless if the following two cases hold: 
Case (1) : X[l, q] does not match P[j, j + q—1]; or 
Case (2) : X[l, q] matches P[j, j + q — 1], and each alignment 
score A(th,j).score > s a (1 < h < k). 

Update and check meaningless calculations on-the-fly. According 
to Theorem 4, we can avoid calculating the fork area starting from 
the (1, j) -entry in the matrix Mx- In order to do it, a simple way 
is to construct an n x m boolean matrix G as follows. A (n t , 7T P )- 
entry of G is 1 if A(ir t , tt p ). score > s a , otherwise, it is 0. 

Consider a matrix Mx and its (1, j) -entry. If there exists a sub- 
string X' and M X '(i,j) > s a , we construct a column vector z. 
Let entries z[th + i — 1] be 1 (i > l,ti < th < tfc) and the 
remaining entries be 0. The (l,j) -entry of Mx is meaningless if 
the bitwise AND operation between the j-th column of G and z 
equals to z. Then the fork area starting from this (1, j) -entry does 
not need to be calculated. If the above bitwise AND operation does 
not equal to z, we update the j-th column of G by doing bitwise 
OR operation between the j-th column of G and z. We repeat the 
above process until all suffix paths have been processed. 

For example, given a text T=GCTAGCTA. Let X'=GCTA be a 
substring of T. Suppose we have calculated the matrix shown in 
Fig. 1 to align GCTA against the query P=GCTAG. Based on this 
matrix M x >, we generate the following boolean matrix G. 

/ 1 1 \ 

10 

10 

1 

1 
10 
10 

V o o o i o / 8x5 

For another substring X=CTAG, before constructing its matrix 
Mx, we find an exact match between X[l, g]=CTAG and P[2, 5]. 
We then check if the (1, 2)-entry of Mx is meaningless. We con- 
struct a column vector z = (0, 1,0,0,0,0,0,0) and make a bitwise 
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AND operation between the second column of G and z. The result 
of this bitwise AND operation equals to z, since X appears once in 
the text T and it associates with one alignments A(l, 1). Therefore, 
(1, 2)-entry of Mx is meaningless. 

3.2.2 Prune Meaningless Forks using q-Prefix Dom- 
ination 

The online approach in Section 3.2.1 requires n x m space to 
store the matrix G, which is space consuming especially when both 
the lengths of the text and the query are large. The question is 
whether there exists a "good" order of calculations to avoid using 
this matrix G. In this section, we show that the answer is yes. 

Based on the analysis in Section 3.1.3, we know that each fork 
must be started at an exact match between the q-prefix of X and a 
substring of P. That is, given another substring X', for any fork in 
Mx', there must exist at least q entries with scores greater than or 
equal to s a . We could use this property to define the order of the 
calculations. We formally define this property below. 

DEFINITION 1. Let p and p%, . . . ,pk be suffix paths from the 
root to leaf nodes in the suffix trie ofT. Let X q and X\ , . . . , Xj? be 
q-pre fixes represented by p andpi, . . . ,pk, respectively. If for each 
appearance of X q at position t, we can always find an appearance 
of a q-prefix in {Xf , . . . , X^} at position t — 1, we say each Xf 
q-dominates X q , denoted Xf >- X q (1 < i < k). 

LEMMA 1 . Given a text T and a query pattern P. Let P [j, j + 
q — 1] and P[j — 1, j + q — 2] be two substrings with length q. It is 
meaningless to calculate the alignment score A(x,j) if one of the 
following conditions holds: 

• We could not find a substring X whose q-prefix exactly match- 
es P[j,j +q- 1]; 

• We could find two substrings X and X' such that X[l, q] — 
P[j, j + q- 1], X' [l,q]=P\j-l,j + q- 2], and X' y X. 

Constructing dominations offline. According to Lemma 1, we need 
to find all dominate relationships among q-length substrings of the 
text T to filter meaningless calculations. We preprocess the text 
T and construct dominations offline in 0(n) time as follows. We 
start from the first character of T and scan the whole text. For 
any two substrings Xf and Xf at position i and i + 1 with q- 
length, respectively, we construct dominate relationship Xf y Xf . 
We require that the q-length substring at position 1 could not be 
dominated by any other q-length substrings. 

Check meaningless calculations on-the-fly. Given a query P, we 
build up g-gram inverted lists on-the-fly as discussed in Section 3.1.3 
For each position j in each gram list, we search P[j,j + q — 1] 
and P\j — l,j + q — 2] in the text T (We show a space-efficient 
approach to simulate traversals of the suffix trie T using com- 
pressed suffix array in Section 5). If we can find exact matches 
XI = P[j,j + q-l] andXf = P\j-l,j+q-2],wdX q y Xf, 
we ignore calculating M X f (1, j)- 

The approach is sound that there cannot be any false dismissals. 
Using this approach, however, we could not guarantee to filter all 
meaningless forks since it also depends on the online calculated 
alignment scores. 

4. REUSING SCORE CALCULATIONS 

There might exist duplicated substrings in both T and P. This 
situation is even going further when considering T can reach length 
of a few gigabytes and P can reach length of several megabytes. 

Obviously, we do not want to align a substring of T against a 
substring of P more than once if they have been aligned before. 



The BASIC algorithm in Section 2 represents a distinct substring 
X starting at positions t\, . . . , tu in T using a suffix path from the 
root to a leaf node in the suffix trie of T. In this way, X is needed 
to be calculated only once and the alignments A(ti + i,j), ■ ■ ., 
A(tk + i,j) can share the same alignment score Mx(i,j). 

A natural question is whether we could share previous calculat- 
ed scores on duplicated substrings in P to speed up the alignment 
process. In this section, we show the techniques of reusing score 
calculations between forks. We first analyze the relationship be- 
tween scores of duplicated substrings in P. We then show how to 
identify duplicated substrings in P that can be reused and present 
an algorithm to reuse score calculations efficiently. 

4.1 Duplicate Alignment Scores 

Consider a matrix Mx shown in Fig. 4. There are two forks 
starting from (1, 7ri)-entry and (1, 712) -entry, respectively. P s is 
the common prefix of P[ni, m] and P[tt2, m]. We mark the en- 
tries with common prefix P a using black color. In the black areas, 
two alignment scores Mx (*, 7i~i + s) and Mx (i, ^2 + s) are equiv- 
alent (0 < s < \P S \), since the substring P[ni,ir 1 + s] equals to 
P[n2, 7T2 + s] and they must find the same alignment against X. 




Figure 4: Entries with a common prefix P s can share alignment 
scores. (Black areas represent reusable alignment entries.) 

LEMMA 2. Given a matrix for a substring X. Let P[iti, tti + 
q — 1] and P[n2, K2 + q — 1] be two substrings that match X[l, q] 
exactly. Let P s be the common prefix of P[ni, m] and P[n2, m]. 
For any two alignment scores Mx + s) and Mx {i, ^2 + s), 
we say Mx(i, + s) equals to Mx(i, IT2 + s )> ifO < s < | i-* s |. 

Fig. 5 shows another case that alignment scores in two forks 
could be duplicate. Assume the substrings P[ni , ji — 1] and P[tt2, 
32 — 1] in the figure do not match. Instead, we could find a common 
substring P a starting from the two FGOEs. Theorem 5 shows that 
the two black areas with common substring P s could share align- 
ment scores if the scores of the two FGOEs are equivalent. 




Figure 5: If two forks have equivalent scores for their FGOEs, 
their entries with common substring P s can share alignment 
scores. (Black areas represent reusable alignment entries.) 

THEOREM 5. Let fx and fi be Two forks starting from (1, -Re- 
entry and (1, -K2)-entry in a matrix Mx, respectively. Let (i\ ,ji)- 
entry and (12, j2)-entry be the FGOEs of fi and f2, respectively. If 
ii equals to 12, then the alignment score Mx (i\ , ji ) must be equal 
to the score Mx (22, J2 )• 
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For example, let A=GCTACCCCCTTTGGAA, q=4, P[ki, ji - 
1]=GCTACACCCTTT and P[n 2 , ji - 1]=GCTACCTCCTTT. Since 
the g-prefix X[l, g]=GCTA matches P[ni, ni+q— 1] and P[7r 2 , tt2 + 
q — 1], there are two forks starting from (1, 7Ti) -entry and (1, -K2)- 
entry in a matrix Mx, respectively. Although P[ni, ji — 1] and 
P[^2,j2 — 1] do not match, using Equation 3, their FGOEs have 
the same score M x (12, 7Ti + 11) = M x (12,7r 2 + 11) = 8. 

LEMMA 3. Suppose there are two fork areas f\ and $2, and 
their FGOEs are (i,ji) and (i,j 2 ) respectively. Let P a be the 
common prefix ofP\j\, m] and P[j 2 , m], then VO < s < \P a \ and 
i' > i, Mx{i' ,ji + s) equals to Mx{i' ,J2 + s). 

4.2 Identify Duplicates in a Query 

Reexamine the two cases described in Figs. 4 and 5. The reusable 
entries belong to two parts: no gap regions and gap regions. If 
an entry (i, j) belongs to a no gap region, we could use Equa- 
tion 3 to calculate the score Mx(i,j)- If an entry belongs 
to a gap region, however, we have to first calculate the other two 
auxiliary scores G a (i,j) and Gb(i,j), and then choose a maxi- 
mal value among G a (i, j), Gb(i,j), and the score calculated using 
Equation 3. It takes more time to calculate a score in a gap region 
than in a no gap region. Therefore, we focus on reusing scores of 
entries in gap regions in this section. 

Consider a matrix Mx that contains k forks. Let . . ., and 

(i, jk) be FGOEs of these forks respectively. When calculating the 
alignment score of the (i, j) -entry in the gap region in a fork with 
FGOE (i, j w ) (ji < j w < jfc-i), we need to record the substring 
P[jw,j] so that whenever we meet a duplicate of this substring 
starting from the next position j w +i in P, we could reuse scores of 
entries in between columns j w and j to entries in between columns 
jw+i and j + j w +i — j w . In order to do it, we need to identify du- 
plicates among any two substrings P[j u ,m] and P[j v ,m], where 
ji < ju,jv < jk and j u j v . 

A straightforward approach is to build up a path for each suffix 
P[j w ,m]. For each node u in the tree, we merge its child nodes if 
they have the same input edge from u. However, this approach is 
both time and space consuming. We could build up a common pre- 
fix tree Tp B in linear time on-the-fly (see Algorithm 2). Instead of 
process each suffix, we use a set of disjoint substrings {P[ji , J2 — 
1], P[j2,j3 — 1], ■ ■ ■, P\jk, m]} to construct Tp s , since each suffix 
P[j w , m] can be assembled by concatenating P[j 

W 1 JW-\-l -M) ■ ■ 

und P[j k ,m}. 

The algorithm CONSTRUCTCPTREE first initializes T Ps using a 
root node root (line 1). It then inserts each substrings' = P[ji,j i+1 
— 1] into T Ps . The algorithm CONSTRUCTCPTREE checks if root 
of Tp s has an outgoing edge. If there is no outgoing edge from 
root, it directly creates a child node c of root and labels the edge 
between root and c using S. Notice that, S is only the prefix of 
P[ji,m]. When processing P[ji+i, ji+2 — 1], we need to concate- 
nate S to corresponding leaf nodes in Tp s . The algorithm sets a link 
from root to c to mark such a leaf node, which need to be processed 
for succeeding substring (lines 4 - 5). Otherwise, the algorithm 
searches P[ji,ji+i — 1] in Tp s until it reaches a node u with lev- 
el I such that the substring S[l, I] is represented by path(root, u) 
(line 7). If there exists an outgoing edge that can match a substring 
S[l + 1, 1'] from I + 1 in S, then the algorithm splits the edge into 
two substrings S[l + 1, /'] and S[l' + 1, \S\] by inserting a new n- 
ode c'; otherwise it creates a new child node c of u and labels the 
edge between u and c using S[l + 1, \S\] (lines 8-12). Finally, the 
algorithm concatenates S to all leaf nodes marked by links (lines 
14 - 16). The algorithm returns the root node of Tp s . The time 
complexity of constructing a common prefix tree for k substrings 
is(D(fc+f). 



Algorithm 2: constructCPTreeQ 



Input: A query P, a vector of column ids F v ; 
Output: Root of the common prefix tree T Pg ; 

1 Initialize a common prefix tree Tp s using a root node root; 

2 foreach 1 < i < k — ldo// process P[j w ,j w +i — 1] 



9 
10 

11 
12 

13 
14 
15 
16 

17 



S = P\j w ,jw+i - 1]; 

if root ofTp 3 has no outgoing edge then 
L Create a node c; edge(root, c) = S; link(root) = c; 

w = link(root); 

Find a deepest node u such that the prefix S" = S[l, I] 

(1 < I < \S\) can be represented by path(root, u); 

if there exists a node v such that the prefix of edge(u, v) matches 

S[l + l,V] (V < IS]) then 

Split edge(u, v) by inserting a node c'; 
Create a node c; edge(c' , c) = S[l' + 1, |S||; 

else 

|_ Create a node c; edge(u, c) = S[l + 1, |S||; 

link(root) = c; 
while w ^ null do 

Set a temp link node v=w\ 

Create a node c; edge(w, c)=S; w = link(v); link(v) = c; 
return root; 



For example, let P=CACGTATACG and assume j 1 = 2,j 2 = 
4, j3 = 6, ji — 8. The constructed procedures for each inserted 
substring P\ji,ji+i — 1] (1 < i < 4) are shown in Fig. 6. Fig. 6(a) 
shows the tree that only contains a substring P[ji,j2 — 1]=AC. 
When inserting the second substring P [72, jz — 1]=GT, the function 
CONSTRUCTCPTREE creates a node of root and let the edge from 
root to this created node be GT since GT has no common prefix with 
substrings in the tree shown in Fig. 6(a). It then processes nodes 
pointed by links from the root and concatenate GT to the leaf node 
under AC and modifies links as shown in Fig. 6(b). Fig. 6(c) shows 
the common prefix tree after inserting P\jz,jn — 1]=AT. Since AT 
and the edge from root to its left child AC has a common prefix A, 
the function needs to split the edge AC into A and C by inserting a 
new node. It then concatenates T under the new inserted node and 
processes all the nodes pointed by links. Fig. 6(d) shows the final 
common tree for P[ji , |P|]=ACGTATACG, P[j 2 , |P|]=GTATACG, 
P[j 3 , |P|]=ATACG, and P[j 4 , |P|]=ACG. 




Figure 6: An example of constructing a common prefix tree. 

Notice that, the online created suffix paths are only related to 
substrings with the same prefix X[l, q]. We could release the on- 
line created suffix paths when processing another matrix with dif- 
ferent prefix X'[l, q\. 

4.3 A Hybrid Algorithm for Efficiently Reusing 
Score Calculations 

In order to reuse the alignment scores of entries in one fork, in- 
stead of calculating the matrix row by row, we do it in hybird. That 
is, we calculate scores of entries in no gap regions horizontally to 
identify FGOEs with the same row, we then calculate scores of 
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Algorithm 3: HYBRID - reusing score calculations 

Input: A substring X, a query pattern P, alignments A; 
Output: The matrix M x ; 

1 Find positions pos se t = {ti, . . . , tk} of all the occurrences of 
X[l,q]mP; 

2 Initialize a matrix M x using a hash table; 

// process entries in no gap regions 

3 F 3e t = calM atrixByRow(X , P, q, pos se t, M x )\ 

4 while F se t ^ do / / process entries in gap regions 

5 Pop all FGOEs with the same row id from F 3e t and push the 
correponding column ids into a vector F v ; 
calMatrixByColumn(X, P, M x , F v ); 

7 return M x ; 



entries in gap regions vertically to make copy operations between 
forks efficiently. Algorithm 3 is an overview of the HYBRID algo- 
rithm. The HYBRID algorithm first locates positions of all match- 
ing grams of X[l, q] in P using the inverted lists of g-grams for P 
(line 1). It then initializes a hash table to store entries in a matrix 
M x for the substring X and the query P. 

In order to calculate all FGOEs in M x , it invokes a function 
cal Matrix By Row to calculate scores of entries row by row in 
no gap regions until all FGOEs have been found. The FGOEs 
are pushed into a queue F set (line 3). Based on the calculated 
FGOEs, the HYBRID algorithm processes entries in each gap re- 
gion represented by an FGOE column by column. According to 
Lemma 3, only when the FGOEs have the same row ids, their 
gap regions could reuse alignment scores. The algorithm HYBRID, 
therefore, popps all FGOEs with the same row id from F se t and 
pushes them into a vector of FGOEs F v . It then invokes a function 
calMatrixByColumn to reuse scores of entries in gap regions 
until all FGOEs in F set have been processed (lines 4 - 6). It finally 
returns the matrix M x represented by a hash table (line 7). 

Horizontal calculations in no gap regions. We use the function 
cal Matrix By Row to show details of calculations in no gap re- 
gions. It initializes a queue of FGOEs F se t- According to length 
filtering (see Theorem 1), it only processes rows less than or equal 
to L max . For each row id i (1 < i < q), it assigns s a x i to 
M x (i, t + i — 1), where t G pos aet is the starting position of an 
occurrence of in P (lines 6 - 7). These scores could be 

assigned without any calculation according to the (/-prefix filter- 
ing in Section 3.1.3. For each row id i(> q) in no gap regions, it 
calculate scores using Equation 3 (line 9). Notice that, the func- 
tion calMatrixByRow only needs to calculate scores of entries 
(i, t + i — 1) (ti < t < t k ) since only these entries might be 
meaningful. If a score M x (i, t + i — 1) is greater than \s g + s s \, 
it means the corresponding (i, t + i — l)-entry is an FGOE (see 
Section 3.1.3). The function pushes the entry {i, t + i — 1) into the 
queue F set and removes the starting position t from pos set (lines 
10 - 12). If the score M x (i,t + i— 1) does not satisfy score filter- 
ing (see Theorem 2), the function removes t from pos set (lines 13 
- 14). We repeat the above process until pos set becomes empty. 

Vertical calculations in gap regions. In order to reuse scores of 
duplicates in P efficiently, we identify duplicates and reuse align- 
ment scores among gap regions. As the Algorithm HYBRID shows, 
for each iteration, the function calMatrixByColumn processes 
a gap region starting from each position in F v . It first constructs 
a common prefix tree Tp s to identify duplicate substrings using P 
and F v (line 1). For each substring P[j w , jw+i — 1], the function 
does not calculate the score of an entry (i,j) unless it could not 
find a duplicate in the common prefix tree Tp s (lines 2 - 21). 



Function calMatrixByRow 



Input: A substring X, a query P, an integer q, starting positions 
pos se t = {ti, . . . , tfc} of occurrences of X[l, q] in P, a 
matrix M x ; 
Output: A queue of FGOEs F se t; 
Initialize a queue F set to store FGOEs of forks in M x \ 
foreach 1 < i < L max do / / length filtering 
if pos ae t == then 
|^ break; 

foreach position t in pos se t do 
if i < q then 

M x {i,t + !-l) = s Xi; 

else 

M x (i,t + i-l) = 

M(i- l,t + i- 2) + S(X[i],P[t + i- 1]); 

if M x (i,t + i - 1) > \s g + s s \ then 
F se t .push_back(i, t + i — 1); 

pos ae t = pos sat - {t}\ 

if M x (i,t + i — 1) does not satisfy score filtering then 

|_ pos ae t = pos S et ~ {*} ; 
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15 return F set ; 



Before calculating score M x (i,j) in a gap region, the function 
calMatrixByColumn checks if a prefix of P[j w , j w +i — 1] is 
a duplicate of another substring starting at a previous position jh 
O'i < jh < jw) in F v . It finds a path path(r, z) to represent 
P[jw, jw+i — 1] (line 3) and then evaluates each edge in path(r, z) 
(line 5). While an edge edge{u, v) has been processed, the func- 
tion could reuse entries associated with this edge, i.e. it copies each 
score M x (i, v. column + d) to M x (i,j w + d) (lines 7 - 9). The 
function calMatrixByColumn repeats the above iteration until 
it meets an edge that has not been processed. For each remaining 
edge in path(r, z), it calculates a range of row ids [s, e] such that 
for each row id i G [s, e] the (i, j w + d)-entry belongs to the cur- 
rent processing gap region. It calculates the alignment scores for 
each (i,j w + d) -entry and records this range [s, e] so that other 
gap regions could reuse scores of entries in this column (lines 14 - 
18). Notice that, when calculating M x (i,j w + d) in a gap region, 
the function takes the advantage of our vertical calculation. It only 
needs one byte to store the auxiliary score G a (i — l,j w + d) and a 
vector to store the auxiliary score Gb in the (j w +d— l)-th column. 
The size of the vector equals to e — s + 1. These auxiliary scores 
could be released after calculating M x (i, j w + d) . The calculation 
stops when all the entries in the (j w + d)-th column of the current 
gap region are meaningless. It releases Tp g after all positions in F v 
have been processed since Tp s is only used locally (line 22). 

Reuse substrings in text T with the same prefix X[l, q\. In the 
BASIC algorithm, the suffix trie T of T provides an advantage of 
avoiding aligning substrings of T that are identical. When moving 
from a node in T to its child node, a row is added to the calculation 
matrix and when a node is going up to its father node, the last row 
is deleted. Hence, the common prefixes in T are only aligned once. 

In order to combine the above advantage in our hybrid calcula- 
tions, we process paths in T which have the common prefix X [1 , q] . 
For each such path, we identify the forks whose FGOEs no longer 
exist as we are going up along the suffix trie and recalculate the 
FGOEs for them using the function calMatrixByRow. Then, we 
vertically calculate these newly updated gap regions using the func- 
tion calMatrixByColumn. We repeat the above process until all 
paths with the same X[l, q] prefix have been processed. 
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Function calM atrixByC 'olumn 



Input: A substring X, a query P, a matrix Mx , a vector of column 
idsF„; 

r = u = constructCPTree(P, F v ); 

foreach 1 < w < k — 1 do / / process P[jw,jw+i — 1] 

Find a node z such that P[j w , jw+i — 1] can be represented by 
pathir. z); 

Let v be the child node of u in path(r, z)\ 
// Reuse entries in gap regions 
while v. column > do / / the substring 
represented by edge(u,v) has been processed 
d = 0; 

while d < edge(u, v).length() do 

// calculate the (j w + d)-th column 
for i £ v.range[j w + d] do 
|_ Mx(i, jw + d) = Mx(i, v.column + d); 

d + +; 



u = v;v = the child node of u in path(r, z); 

12 repeat// Calculate entries in gap regions 

13 d = 0; 

14 while d < edge(u, v).length() do 

15 Calculate a range of row ids [s, e] such that Vi £ [s, e], 
the (i, ju, + d)-entry belongs to the current gap region; 

16 for i g [s, e] do 

17 |_ Calculate M x (i,j w + d)\ 

18 v.range[j w + d] = [s, e]; 

19 v. columns j w ; 

20 u = = the child node of u in path(r, z); 

21 until a// rfie scores in the (j w + ci)-//i column of the current gap 
region are meaningless; 

22 Release the common prefix tree Tp s rooted at r; 



5. SIMULATING SEARCHES USING COM- 
PRESSED SUFFIX ARRAY 

As discussed in Sections 3 and 4, our technique requires the fol- 
lowing three kinds of searches in the suffix trie T of a text T: 

(1 ) Given a q-length substring S q in the query P, check if S q ap- 
pears in T exactly. In the scenario of this paper, for each suffix path 
in T, the alignment scores of the (i + l)-th row in the matrix M 
depend on the scores of the i-th row in M. Therefore, based on the 
SA range of X, we hope to process the string X' — Xc by itera- 
tively appending one character c behind X. We use the technique 
reported in [8] to construct a compressed suffix array for the rever- 
sal of $T (denoted T _1 ). In T _1 , we do not change the position 
of each character in T and let position of $ be 0. 

Given a g-length substring S q , we search (S 9 ) -1 using the back- 
ward search algorithm [6] on the compressed suffix array and get 
an SA range from SA[i] to SA[j] for (S 9 )" 1 . S q does not appear 
in T only when i < j. This search operation could be done in 0(q) 
steps and each step costs constant time. 

(2) Given a substring X 3 = X[l,i], find its starting positions of 
all occurrences in T. We search X^ 1 and find the SA range from 
cM[:r] to SA [y] for X^ 1 . The position of each appearance of X s 
in T is ySA[/i] — |X S | + 1, where x < h < y. For example, T = 
GCTAGC$, and T _1 = C 6 G 5 A 4 T 3 C 2 G 1 $°, where integers represent 
positions of characters in T. Let |X S | be GC, we search its reversed 
string CG and get the SA range [2, 3]. The suffix array SA[0, 6] = 
{0, 4, 2, 6, 1, 5, 3}, so the positions of the query GC are SA[2] - 
\X S \ + 1 = 1 and SA[3] -\X e \ + l = 5. 

Notice that, in the matrix Mx, we always process af- 
ter X[l, i — 1], Since X[l,i] equates to appending X[i] behind 
X[l, i — 1], we could find the appearances of X[l, i] in T in 0(1) 
time based on the SA range of X[l,i — 1] using the backward 
search algorithm [6]. 



(3) Given a q-prefix X q , traverse the suffix trie and get suffix paths 
whose represented substrings have the same prefix as X q . Let u 
be a node in the conceptual suffix trie and X q be the represented 
substring of the path from the root to u. 

Since we have found the SA range of (X 9 ) -1 using the com- 
pressed suffix array of T _1 , we can check the existence of edge 
with label c from u by computing the SA range for c(X q )~ 1 . We 
enumerate the corresponding substring if the edge c does exist and 
repeat the same procedure to traverse the subtree rooted at u. 

6. ANALYSIS OF NUMBER OF CALCULAT- 
ED ENTRIES 

We consider the general scoring scheme (s a , Sb,s g ,s s ). As we 
have analyzed in Section 3.1, the larger the j^4, j^j, and |j4 are, 
the better performance of local filtering techniques could be. In 
order to understand the behaviors of ALAE deeply, we analyze the 
number of calculated entries. 

We consider the general scoring scheme for any random sub- 
string X d with d characters (d > 1) in the text T. According to 
score filtering, we are interested in each alignment substring P' of 
the query P such that the alignment scores between P' and T are 
positive. For a simple case that an alignment cannot insert a space 
or a gap, let P' contain d characters. We define f{d) to be the 
number of length-(d) substring P' such that score(X d , P') > 0. 

LEMMA 4. When there is no gap between X d and P' , we have 



f{d) < fci(fe) , where ki = (1 - 1 



1 \q/ cr-l ' 



Sf2-n(s-l) 



k 2 = 



( s _l)«-l> ana S - 1 + | Sa |' 



PROOF. Since s a > and s b < 0, when score(X d ,P') > 0, 
the largest number of mismatches is [d/s\. According to prefix fil- 
tering in Theorem 3, mismatches could not appear in either X[l, q] 
orP'[l,q\. 

Therefore, f(d) < (<t - l) l ( d r). Since (?) = ^(V), 

we conclude (V) = { ^%^^^ {f) < (l-^) 9 (f), 

then/(d)<E|l ( ](^l) l (l-^) 9 (f)<(f5|)^-l)7(l-i) 9 (|l|). 

As we know, (f) = , d _ \y it . Using the Stirling's approximation, 
d\ = v / 2^(f) d e A rf, where < \ d < J L-. Therefore, 

d\ v / 2^d(-) d e A <i 



(d- 



Obviously, X d - \ d - 

(f)< 



/2w(d-i) 



^27r(d- i)(^) d - i e A rf-i v / 2^d(j) i e A i 

dd+ ^ e \ d -*d-i-*i m 

i) d - i+ h(i) i+ h 

Xi < 0, then e Xd - Xd - i ~ x * < 1. Thus, 

1 ( d \d+ 1 / d — i \i 



l ,d-± . d 



so-( L |j)<^a(^) a+3 (^) 

— ) d . Therefore, 



y/2Tr(s-l)d 



-)" 



Based on the analysis in [8], the expected total number of calculated 
entries of ALAE is 



+ 



a — ko 



(4) 



BLAST specifies scoring parameters at http://blast.ncbi.nlm.nih.gov/ 
Blastcgi, where {s a ,s b ) £ {(1, -2), (1, -3), (1, -4), (2,-3), 
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(4, 
and 



-1)}. For most of the parameters, 



\Sg 



e {1,2,3,5} 



-5),(1, 

||^| € {1,2}. According to Equation 4, the upper bound 
of the number of calculated entries for DNA sequences can vary 
from 4.50mn 0,520 to 9.05mn ' 896 , and for protein sequences can 
vary from 8.28mn ' 364 to 7.49mn 723 . Notice that both BLAST 
and BWT-SW use (1, —3, —5, —2} as the default scoring scheme 
when aligning DNA sequences. Using this scoring scheme, the 
number of calculated entries using BWT-SW is upper bounded by 
69mn 0,628 (see [8]), whereas using ALAE the number is upper 
bounded by 4.47mn 6038 . 

7. EXPERIMENTS 

In this section, we report our experimental results of ALAE. 
Data sets: We used the following commonly used real data sets, 
including two DNA data sets and one protein data set. The alpha- 
bet size of DNA sequences and protein sequences are 4 and 20, 
respectively. 

Human genome data set. The human reference sequence (GRCh 
37) was assembled from a collection of DNA sequences. 2 It con- 
sists of 24 chromosomes ranging in length from 48 million to 249 
million. We used subsequences with different lengths of GRCh37 
as texts to be aligned with, and the lengths varied from 50 million 
to 1 billion. 

Mouse genome data set. The mouse genome (MGSCv37 chrl) 
was extracted from house mouse that contains 198 million charac- 
ters. 3 Since aligning mouse genomes against human genomes is 
widely used to do homology search in practice [7, 12], we used 
MGSCv37 chrl to generate queries against human genomes. We 
randomly chose 100 starting positions in the first 180 million char- 
acters and picked a fixed length substring from each randomly lo- 
cated starting position to generate a query workload that contains 
100 query sequences with the same query length. We varied the 
query length from 1 thousand to 1 million. We used these query 
workloads to test performance of ALAE. 

Protein data set. We used the comprehensive and non-redundant 
database UniParc 4 that contains most of the publicly available pro- 
tein sequences in the world. We varied the lengths of texts (protein 
sequences) from 10 million to 50 million. We randomly chose se- 
quences from UniParc as queries, ranging in length from 200 to 
100,000. 

Threshold H and i5-value: In our experiments, instead of set- 
ting a threshold value H explicitly, we used an Expectation val- 
ue (a.k.a. iJ-value) that is widely adopted by the biological com- 
munity. The _E-value is a parameter that describes the number 
of alignments one can "expect" to see by chance when search- 
ing a database of a particular size. The following equation re- 
lates S-value and alignment score S: E = Kmne~ xs , where K 
and A are scaling constants computed by BLAST [1]. The corre- 
sponding threshold H for ALAE can be computed as follows [11]: 

H = ^MKmn)-ln(E) ^ Wg ^ {mm 1Q -15 tQ 1Q 

Both BLAST and BWT-SW set E = 10 as the default parameter. 

Scoring scheme: In our experiments, we used the same scoring 
parameters as BLAST (see Section 6) to evaluate the performance 
of ALAE. Both BLAST and BWT-SW adopt (1, -3, -5, -2) as 
the default scoring scheme. Notice that BWT-SW requires that 
\sb\ > 3 1 s 1 , which highly limits its usability. 

2 http://hgdownload.cse.ucsc.edu/goldenPath/hgl8/chromosomes/ 
3 http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/ 
4 ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc/ 
uniparc_active.fasta.gz 



Table 2: Comparison of alignment time and number of align- 
ment results when varying lengths of queries (n = 1 billion). 
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Table 3: Comparison of alignment time and number of align- 
ment results when varying lengths of texts (m — 1 million). 
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All the algorithms were implemented using GNU C++. The ex- 
periments were run on a PC with an Intel 2.93GHz Quad Core CPU 
i7 and 8GB memory with a 500GB disk, running a Ubuntu (Linux) 
operating system. 

7.1 Alignment Time and Number of Results 

We compared ALAE with three state-of-art algorithms: Smith- 
Waterman algorithm, BLAST, and BWT-SW 5 using the default set- 
tings of both BLAST and BWT-SW (i.e. (1, -3, -5, -2} and E = 
10). We would not include the Smith-Waterman algorithm into our 
following discussions because this algorithm is too slow to be con- 
sidered. Our experiments show that the Smith-Waterman algorithm 
took 7.7 hours to align a query with 10 thousand characters against 
a text with 50 million characters. However, ALAE only took 25 
ms. For the same reason, we would not report the query perfor- 
mance for the BASIC algorithm since it has higher time complexity 
than the Smith- Waterman algorithm. We conducted experiments to 
show the average time required by ALAE, BLAST, and BWT-SW 
under different circumstances. 

Table 2 shows the average alignment time and the number of 
alignment results when varying the lengths of queries from 1 thou- 
sand to 10 million. We used a 1 billion human genome sequence 
as the text. ALAE shows a great advantage over BWT-SW with 
all queries and can find the results as BWT-SW does. Notice that 
our experiments show that BWT-SW could not align a query with 
more than 1 million characters against the text due to insufficien- 
t memory. ALAE outperformed BLAST when the query lengths 
were less than 10 million. However, when the query length was ex- 
tremely long, such as 10 million, the alignment time was not as fast 
as BLAST. It is worth mentioning that ALAE found more results 
than BLAST did. 

Table 3 shows the average alignment time and number of align- 
ment results when varying the length of a text from 50 million to 
1 billion. We used the query workload, in which each query had 1 
million characters. Table 3 shows ALAE outperforms both BWT- 
SW and BLAST for different text lengths when m = 1 million. 

We also conducted experiments on protein sequences. The re- 
sults for the protein data sets are similar to those presented here. 
For space reason, we omit these results in this paper. 

7.2 Filtering Ratio and Reusing Ratio 

In this section, we show the effectiveness of our proposed filter- 
ing and entry reusing techniques. We use filtering ratio to evaluate 
our filtering techniques compared with BWT-SW: 



Filtering ratio = 



# of filtered entries 



# of calculated entries using BWT-SW 



Xl00% (5) 



Available at http://ixs.hku.hk/ ckwong3/bwtsw 
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Table 4: Number of calculated entries and their computation 
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1,245,288 x 3 


3,735.864 


21,827,128 x 3 


65,481,384 
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Filtered entries are the entries which are calculated using BWT- 
SW but are considered meaningless using ALAE. A higher filtering 
ratio means our filtering techniques are more effective. 

We use reusing ratio to evaluate our entry reusing technique: 



Reusing ratio : 



# of reused entries 
# of accessed entries 



X 100% 



(6) 



Reused entries are the ones whose scores can be simply copied 
from previous calculated entries using ALAE. Accessed entries 
consist of both reused entries and calculated entries. The higher 
the reusing ratio is, the more effective our reusing technique is. 
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Figure 7: Filtering and reusing ratios using (1,-3,-5,-2). 

We conducted experiments with different lengths of queries and 
texts. Figs. 7 shows the filtering ratios and reusing ratios using 
ALAE, when E = 10 and the scoring scheme is (1, —3, —5, —2). 

Fig. 7(a) shows that a query workload with shorter queries main- 
tains a higher filtering ratio. For a fixed text with 100 million char- 
acters, the filtering ratio decreased from 75.3% to 51.8% when the 
query length increased from 1 thousand to 10 million. The reason 
is that when the query length is short, calculated entries are mainly 
belonging to no gap regions. In a no gap region, ALAE could filter 
more meaningless entries compared with BWT-SW. 

Fig. 7(b) shows when the query length increases from 10 thou- 
sand to 10 million, the reusing ratio increases from 16.2% to 31.5%. 
This is because longer queries contain more repetitions and more 
entries can be reused during the alignment process. When the query 
length is 1 thousand, the reusing ratio is very low since it is hard to 
find duplicates among forks. 

Figs. 7(c) and 7(d) show both the filtering ratio and reusing ratio 
keep stable when changing the length of a text for a fixed query 
workload. The reason is that ALAE only considers a substring with 
small length L max to do alignment against the query. 



ALAE not only reduces the number of entries that need to be 
calculated by BWT-SW, but also optimizes the cost for computing 
scores. Table 4 shows how the number of calculated entries trans- 
lates into the computation cost. Using ALAE, calculated entries 
could belong to a no gap region or a gap region in each matrix. In a 
no gap region, ALAE uses the simplified recurrence function (see 
Equation 3) to calculate score for each entry, whereas BWT-SW has 
to consider the auxiliary scores in both G a and Gb, which requires 
extra computation costs. Similarly, the entries in the boundaries of 
the fork areas using ALAE only rely on their two adjacent entries 
instead of three adjacent entries using BWT-SW. 
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(a) Filtering ratio (n=l billion), (b) Reusing ratio (n=l billion). 
Figure 10: Filtering and reusing ratios using different scores. 

7.3 Effect of e- Values 

In this section, we examine how ALAE would be affected by E- 
values. Fig. 8 shows the alignment time of ALAE when varying 
E from 10~ 15 to 10 using three different query workloads. We 
can see that ALAE is not very sensitive to .E-values. For the query 
workload that contains queries with 10, 000 characters, the align- 
ment time is 72ms when E is 10~ 15 , 72.9ms when E is 10~ 5 , and 
79.9ms when E is 10. The time is too small to be seen in this fig- 
ure. For any given query workload, ALAE shows small time rises 
when we increase E. The reason of these time rises is that a large 
H value (i.e. small E value) terminates calculations earlier than a 
small H value. Notice that such rises are very small since score 
filtering only shares a small impact on accelerating alignment time. 

7.4 Effect of Scoring Schemes 

In order to test the effect of scoring schemes, we chose four 
scoring schemes in BLAST by varying values s a , Sb, s B , and s 3 . 

Fig. 9 shows the four representative scoring schemes that cover 
large and small values of q and \s g \ + \s s \. Both ALAE and BWT- 
SW are sensitive to scoring schemes, whereas BLAST is barely 
affected by the change of scoring schemes, because BLAST adopts 
a different heuristic approach to find results. 

ALAE runs much faster than BWT-SW for all scoring schemes. 
Notice that we do not include the result of BWT-SW for (1,-1, 
-5, -2} since BWT-SW requires that \s b \ > 3\s a \. ALAE shows 
good performance when the scoring scheme is (1, —3, —5, —2) or 
(1, -4, -5, -2). ALAE is 119 times faster than BWT-SW using 
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Table 5: Number of entries using ALAE. 

Scoring schemes | # of reused entries | # of accessed entries | # of calculated entries 



< 1,-1,-5, -2 > 


30,652,400 


380,960,680 


350,308,280 


< 1,-3,-2,-2 > 


19,047,958 


124,804,117 


105,756,159 



(1, —3, —5, —2) and 65 times faster using (1, —4, —5, —2). This is 
because a larger J^4 makes L m „ much tighter, a smaller s a makes 
calculation terminating earlier, and a larger q value (see Equation 2) 
makes prefix filtering more effective. 

Fig. 9 shows that ALAE is slower than BLAST when the scoring 
scheme is (1, — 1, —5, —2). We use Fig. 10 to explain the reason. 
The small sj, value makes gap regions expanded, which results in 
large number of calculated entries. Table 5 shows that the num- 
ber of calculated entries using (1, — 1, —5, —2) is 37 times of the 
number using (1, —3, —5, —2). We can see that the reusing ratio 
is much lower than other scoring schemes. This result is consistent 
with the analysis in Section 6, where (1, — 1, —5, —2} corresponds 
with the worst case where the number of calculated entries is up- 
per bounded by 9.05mn°' 896 . For (1, -3, -2, -2), the smaller 
\s g | + | s s j value makes the no gap regions smaller, which weakens 
the effect of filtering techniques (see Fig. 10(a)). 

7.5 Index Size 

We have evaluated the space efficiency of ALAE for both DNA 
sequences and protein sequences. We varied the length of texts and 
collected their index sizes. We used the following scoring schemes: 
(1, —3, —5, —2) for DNA sequences, and (1, —3, —11, —1) forpro- 
tein sequences. Fig. 1 1(a) shows the index sizes of ALAE for DNA 
sequences. The alphabet size of the DNA sequences is 4, thus ev- 
ery character in BWT sequence can be stored using 2 bits. As we 
can see in Fig. 1 1(a), the indexes for storing dominate relationships 
of DNA sequences are mostly too small to be seen. 

Figs. 1 1(b) shows the index sizes of ALAE for protein sequences. 
For relatively small texts, the index for storing dominate relation- 
ships is large compared with BWT index. However, as the size of 
a text grows, the size of the dominate index decreases quickly. The 
dominate index size is 98.09MB for a text with 10 million charac- 
ters and decreases to 8.83MB when the length of the text increases 
to 20 million. 



j 2000 
I 

1000 ■ 





Size of a text (M) 

(a) DNA sequences 



10 20 30 40 50 
Size of a text fM) 



(b) Protein sequences. 
Figure 11: Index size. 

8. CONCLUSION AND FUTURE WORK 

We have developed a novel approach ALAE to accelerate dy- 
namic programming for finding all local alignments. We gave a 
full analysis of the dynamic programming approach, and present- 
ed a series of filtering techniques to prune meaningless entries and 
an algorithm to reuse duplicate calculations. Our extensive experi- 
ments on real biosequences showed the high efficiency of our tech- 
niques. ALAE improves the time efficiency of the state-of-the-art 
exact BWT-SW approach significantly and accelerates BLAST for 



most of the scoring schemes. As parts of future work, we will inves- 
tigate techniques to further improve the performance of ALAE for 
all scoring schemes and exploit algorithms using external memory. 
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